jdesouzarangel7235@floridapoly.eduI explored the dataset to understand the distribution of players, their ratings, and how various attributes contribute to a player’s overall rating. We used a combination of tidyverse, sf, plotly, and modeling libraries in R.
When we think of top players in FIFA 18, stars like Messi, Ronaldo, and Neymar instantly come to mind. But what if some of the most efficient and underrated talents are buried deep in the dataset?
I wanted to understand the correlation between players and their stats. Certain countries have a fame for soccer - do they consistently produce good players or are there just lucky outliers? Does it show that they care and invest more about the sport? Do top players have consistently the same ratings? What makes a good rating?
Initially, the goal was to create the following visualizations:
Interactive Plot: Scatter plot of Actual vs. Predicted Overall Ratings, with player-specific hover info.
Spatial Visualization: World map showing average player rating per country.
Model Visualization: Coefficients plot of a linear regression model predicting overall from player attributes.
I had some difficulties making the predictor and standardizing it, as well as making the graphs informative but not cluttered.
# Load required packages
library(tidyverse)
library(sf)
library(plotly)
library(rnaturalearth)
library(rnaturalearthdata)
##
## Attaching package: 'rnaturalearthdata'
## The following object is masked from 'package:rnaturalearth':
##
## countries110
library(ggthemes)
# Read the dataset
fifa18 <- read_csv("./fifa18.csv")
## Rows: 17076 Columns: 40
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, nationality, club
## dbl (37): age, overall, potential, acceleration, aggression, agility, balanc...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the data
fifa18
## # A tibble: 17,076 × 40
## name nationality club age overall potential acceleration aggression
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Cristiano … Portugal Real… 32 94 94 89 63
## 2 L. Messi Argentina FC B… 30 93 93 92 48
## 3 Neymar Brazil Pari… 25 92 94 94 56
## 4 L. Suárez Uruguay FC B… 30 92 92 88 78
## 5 M. Neuer Germany FC B… 31 92 92 58 29
## 6 R. Lewando… Poland FC B… 28 91 91 79 80
## 7 De Gea Spain Manc… 26 90 92 57 38
## 8 E. Hazard Belgium Chel… 26 90 91 93 54
## 9 T. Kroos Germany Real… 27 90 90 60 60
## 10 G. Higuaín Argentina Juve… 29 90 90 78 50
## # ℹ 17,066 more rows
## # ℹ 32 more variables: agility <dbl>, balance <dbl>, ball_control <dbl>,
## # composure <dbl>, crossing <dbl>, curve <dbl>, dribbling <dbl>,
## # finishing <dbl>, free_kick_accuracy <dbl>, gk_diving <dbl>,
## # gk_handling <dbl>, gk_kicking <dbl>, gk_positioning <dbl>,
## # gk_reflexes <dbl>, heading_accuracy <dbl>, interceptions <dbl>,
## # jumping <dbl>, long_passing <dbl>, long_shots <dbl>, marking <dbl>, …
When analyzing the FIFA 18 player dataset, one way to gauge a country’s investment in soccer is by looking at both the quantity and quality of players it produces. Countries like Spain, Germany, Brazil, and France stand out not only for having a high number of registered players in the game but also for their consistently high average player ratings. Does higher investment translate into top talent?
Data Preparation
Player nationality data was extracted from the fifa18 dataset.
A count of players was computed for each nationality.
A world map shapefile was loaded using the rnaturalearth package.
The player counts were joined to the map data by matching the country names in both datasets.
Missing values were replaced with 0.
Visualization
Countries are color filled based on the number of players using a square-root transformation to improve visual differentiation. The tooltip indicates country and number of players.
Interpretation
The resulting map shows which countries have the highest number of FIFA 18 players.
As we can see, Brazil, Germany, Argentina, Spain, and France seem to among the countries with more players. This was predictable, as they are well known for soccer as a dominant sport in their cultures.
################################################################################
# Data preparation
################################################################################
# Count players by nationality
player_counts = fifa18 %>%
count(nationality, name = "num_players")
# Load world map
world = ne_countries(scale = "medium", returnclass = "sf")
# Join player counts with world map by matching nationality to map's name column
world_players = world %>%
left_join(player_counts, by = c("name" = "nationality"))
# Replace NAs with 0
world_players$num_players[is.na(world_players$num_players)] = 0
################################################################################
# Plot
################################################################################
numberOfPlayersPerCountryPlot = ggplot(world_players) +
geom_sf(aes(fill = num_players,
text = paste(name,
"<br>Players:", num_players)),
color = "darkgray") +
scale_fill_viridis_c(option = "D", trans = "sqrt", na.value = "darkgray") +
labs(
title = "Number of FIFA 18 Players by Country",
fill = "Player Count"
) +
theme_minimal()
## Warning in layer_sf(geom = GeomSf, data = data, mapping = mapping, stat = stat,
## : Ignoring unknown aesthetics: text
# Display plot
ggplotly(numberOfPlayersPerCountryPlot, tooltip = "text")
Countries like Argentina, Germany, Spain, and Brazil stand out, predictably, but some other countries like Algeria and Egypt also have top performing players. Elite talent tends to cluster, especially in countries with long-standing tradition of excellence in the sport. Argentina and Portugal, for example, are home to global icons like Lionel Messi and Cristiano Ronaldo, who were among the highest-rated players in the game. Germany, Spain, Brazil, and France also feature prominently, each contributing top-tier players across various positions — from attackers to goalkeepers. This concentration reflects not just individual brilliance but the strength of national soccer that consistently produce world-class talent. A high number of players combined with high average ratings shows that these countries consistently produce and invest in top-talent. Yet we do see excellent players emerging in countries with less tradition, such as Egypt and Algeria as well.
Objective
Find the average overall rating of soccer players by nationality.
Data Preparation
Visualization
A map shows countries shaded according to the average player rating. The tooltip shows the country name and average player rating.
Interpretation
Countries with a higher average player rating (like Spain, Germany, and Argentina) are highlighted. These mostly match the number of players, indicating that they not only have a high number of players, but specifically a high number of talented players.
################################################################################
# Data
################################################################################
# Compute average player rating by nationality (only if more than a few players)
avg_rating_by_country = fifa18 %>%
group_by(nationality) %>%
filter(n() >= 5) %>% # Filter to avoid skewed averages for small countries - we want more than 5 players
summarize(avg_overall = mean(overall, na.rm = TRUE))
# Load world map
world_shapes = ne_countries(scale = "medium", returnclass = "sf")
# Join world map with average ratings
world_ratings = world_shapes %>%
left_join(avg_rating_by_country, by = c("name" = "nationality"))
################################################################################
# Plot
################################################################################
averageRatingByCountryPlot = ggplot() +
geom_sf(data = world_ratings, aes(fill = avg_overall,
text = paste0(name, "<br>Avg Rating: ", round(avg_overall, 1))),
color = "darkgray", size = 0.1) +
scale_fill_viridis_c(option = "D", na.value = "darkgray", name = "Avg Rating") +
labs(
title = "Average Overall Rating of Players by Country (FIFA 18)",
x = NULL, y = NULL
) +
theme_minimal()
## Warning in layer_sf(geom = GeomSf, data = data, mapping = mapping, stat = stat,
## : Ignoring unknown aesthetics: text
ggplotly(averageRatingByCountryPlot, tooltip = "text")
I tried fitting a linear regression model to predict a player’s overall rating based on selected performance attributes.
How can a player’s overall rating can be predicted based on their individual performance attributes? Pace, shooting, passing, dribbling, defending, and physicality are detailed in the dataset and often correlate with a player’s effectiveness.
Likely top predictors would be potential, reactions, composure.
Objective
The goal is to understand the influence of some skills on a player’s overall rating using a linear regression approach.
Data Preparation
Interpretation
Reactions, potential, and composure seem to be key drivers. This insight can be useful for player development analysis.
################################################################################
# Data
################################################################################
# Select predictors and response, then drop missing values
fifa_model_data = fifa18 %>%
select(overall, potential, dribbling, shot_power, short_passing,
composure, reactions, stamina, strength) %>%
drop_na()
# Standardize predictors (but not the response variable)
fifa_model_scaled = fifa_model_data %>%
mutate(across(-overall, scale))
# Fit the linear model with standardized predictors
model = lm(overall ~ ., data = fifa_model_scaled)
# Tidy model output
model_summary <- broom::tidy(model)
################################################################################
# Plot
################################################################################
# Plot coefficients
ggplot(model_summary, aes(x = reorder(term, estimate), y = estimate)) +
geom_point(size = 3, color = "steelblue") +
geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error),
width = 0.2, color = "darkgray") +
coord_flip() +
labs(
title = "Predictor Coefficients for Player Overall Rating",
x = NULL,
y = "Coefficient Estimate"
) +
theme_minimal()
Objective
We will try to evaluate the performance of the model by comparing predicted player overall ratings to their actual ratings in the FIFA 18 dataset.
Interpretation
The points closely clustering around the diagonal line suggest that the model performs reasonably well in predicting overall ratings. Some dispersion is visible. The plot indicates that the model is a decent estimator of player ratings using core skill attributes, though there remains some room for refinement. The interactive format makes it easier to inspect a particular player’s rating.
################################################################################
# Data
################################################################################
# Select variables and drop NAs
fifa_model_data <- fifa18 %>%
select(overall, potential, dribbling, shot_power, short_passing,
composure, reactions, stamina, strength) %>%
drop_na()
# Standardize predictors (but not the response)
fifa_model_scaled <- fifa_model_data %>%
mutate(across(-overall, scale))
# Fit model
model <- lm(overall ~ ., data = fifa_model_scaled)
# Add predictions
fifa_predictions <- fifa_model_scaled %>%
mutate(predicted_overall = predict(model, newdata = fifa_model_scaled))
# Select variables and drop missing values
fifa_model_data <- fifa18 %>%
select(name, nationality, overall, potential, dribbling, shot_power, short_passing,
composure, reactions, stamina, strength) %>%
drop_na()
# Standardize predictor variables only
fifa_model_scaled <- fifa_model_data %>%
mutate(across(c(potential, dribbling, shot_power, short_passing,
composure, reactions, stamina, strength), scale))
# Fit the model
model <- lm(overall ~ potential + dribbling + shot_power + short_passing +
composure + reactions + stamina + strength,
data = fifa_model_scaled)
# Add predictions
fifa_predictions <- fifa_model_scaled %>%
mutate(predicted_overall = predict(model, newdata = fifa_model_scaled))
################################################################################
# Plot
################################################################################
# Predicted vs Actual, colored by nationality
plot_ly(
data = fifa_predictions,
x = ~overall,
y = ~predicted_overall,
type = "scatter",
mode = "markers",
marker = list( color = "#1F968B", opacity = 0.3, size = 7),
text = ~paste(
"Name:", name,
"<br>Nationality:", nationality,
"<br>Actual:", overall,
"<br>Predicted:", round(predicted_overall, 1)
),
hoverinfo = "text"
) %>%
layout(
title = "Predicted vs Actual Overall Ratings (FIFA 18)<br><sub>Colored by Nationality</sub>",
xaxis = list(title = "Actual"),
yaxis = list(title = "Predicted"),
shapes = list(
list(
type = "line",
x0 = min(fifa_predictions$overall),
x1 = max(fifa_predictions$overall),
y0 = min(fifa_predictions$overall),
y1 = max(fifa_predictions$overall),
line = list(dash = "dash", color = "#481567")
)
),
annotations = list(
list(
x = 93,
y = 94,
text = "Top Player",
showarrow = TRUE,
arrowhead = 2,
ax = -50,
ay = -40,
font = list(color = "white", size = 12),
bgcolor = "#1F968B",
bordercolor = "gray",
borderwidth = 1
)
)
)